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1—5 . Abstract 

■ Many experiments in medicine and ecology can be conveniently modelled by finite 
Gaussian mixtures but face the problem of dealing with small data sets. We propose 

Q I a robust version of the estimator based on self-regression and sparsity promoting 

^ ' penalization in order to estimate the components of Gaussian mixtures in such 

contexts. A space alternating version of the penalized EM algorithm is obtained 
and we prove that its cluster points satisfy the Karush-Kuhn- Tucker conditions. 
Monte Carlo experiments are presented in order to compare the results obtained 
by our method and by standard maximum likelihood estimation. In particular, our 
^ , estimator is seen to perform well better than the maximum likelihood estimator. 

(N 

^ I Key words: finite Gaussian mixtures, maximum likelihood estimation, Kullback 

Tj- ■ Proximal Point algorithms, EM algorithm, 11 penalization, LASSO, sparsity, 

regression mixtures, model based clustering 

o 

O 

Finite Gaussian mixture models are widely used in a great number of appli- 
^ . cation fields as a means to perform model based classification. From pattern 

recognition to biology, from quality control to finance, many examples have 
shown the pertinence of the Gaussian mixture model approach. The book [13] 
is the most comprehensive reference for finite non necessarily Gaussian mix- 
ture models with many application examples. In Gaussian mixture models, 
the data Yi, . . . , 1^ are assumed i.i.d. and to be drawn from the density 
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where the vector 9* = {pi, . . . nl, . . . , E^, . . . , E^) is an unknown mul- 
tidimensional parameter. To this model, we traditionally associate an extended 
model using the notion of complete data. In mixture models, the complete 
data are independent and identically distributed couples of the form {Yi,Zi) 
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where Zi is a multinomial random variable taking values in {1, . . . , i^T} with 
P{Zi = k) = ttI and which represents the index of the mixture component 
from which observation i was drawn. We assume that conditionally on the 

event Zi = k, Y, has density -j=^== exp ^{x - fil)J:i-\x - fil)^. The 
variables Zi, . . . , Zn being unobserved, they are usually called latent variables. 

The standard approach for estimating 6* is the maximum likelihood method- 
ology which consists of finding 9 which maximizes the log-likelihood function 

i=i ^k=i y(27r)'^det(Sfc) ^ 2 V 

over the set 9 = {(tti, . . . , tt^, /xi, . . . , ^k, Ei, . . . , Ex) | e M+, fik e 
M'^, Sfc e Sj", and Y.k=i'^k = 1} where Sj" denotes the set of all symmet- 
ric positive semidefinite matrices. 

Interestingly enough, the supremum of the log-likelihood function over 9 is 
equal to -|-oo and is obtained for singular covariance matrices. A study of 
the one dimensional case was made in ^2j. Thus, exact likelihood maximiza- 
tion is not the good approach for this problem. However, many researchers 
and practitioners have noticed that some local maximizer of the log-likelihood 
function is in fact consistent in practice. From the numerical viewpoint, lo- 
cal maximizers of the log-likelihood function are usually obtained using the 
Expectation-Maximization (EM) algorithm of Dempster Laird and Rudin [8j. 
This algorithm is a nice procedure with closed form expression of each itera- 
tion in the Gaussian mixture case. The EM algorithm for mixture models is 
available in the MIXMOD package [1] within Matlab or Scilab for instance. 

Beside the question of finding the right local optimizer of the likelihood func- 
tion, one of the main problems for estimating 9* relies in having a sufficiently 
great sample size. Usually, large sample sizes may sometimes be available in 
a number of applications such as pattern recognition or financial time series 
analysis but in ecology for instance the sample size may be very small in sit- 
uations where finite mixture models are suspected to be very pertinent due 
to the biological context. The goal of this paper is to remedy this problem by 
proposing a new methodology for Gaussian mixture model estimation in the 
case where the sample size is extremely small. Our approach needs to provide 
a certain amount of robustness. In the same spirit as for the median in the 
one dimensional case, the main idea is to express the estimators of the ^^'s 
as a combination of a small number of data in the middle of each cluster. 
This is simply done by restricting the search to the data's span, i.e. to obtain 
the /ifc's as a regression with covariates the data themselves and to impose an 
additional sparsity constraint on the regression vectors. In order to simplify 
the analysis, we will assume the covariance matrices to be of the form all. 
The CTfc's and the tt^'s can also be estimated using for instance a maximum 
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likelihood approach conditioned on the estimated value of the /i^'s. 

The wole procedure is formally equivalent to joint variable selection and es- 
timation in a mixture of regression model. Variable selection and estimation 
are performed using /i-penalized EM steps which reduce the complexity of the 
regression model just as for the LASSO [Hj. Encouraging simulations results 
show that the proposed approach correctly estimates the class of 8 over 10 
points on average for a mixture of 3 Gaussians in dimension two. Monte carlo 
experiments are performed for samples sizes of 10 points and dimension grow- 
ing up to to 50 showing a good behavior of the method which outperforms the 
standard maximum likelihood estimator. 



1 Presentation of the method 

1.1 Recalls on regression mixtures and model selection with nondifferentiahle 
penalties 

In order to introduce our method, we will need some preliminary notions on 
mixtures of regressions and the LASSO algorithm for sparse variable selection 
in regression models. 

1.1.1 Regression mixtures 

The Gaussian regression mixture assumes that the observations are couples of 
the form {Y,X) where X takes its values in W and conditionally on X, the 
real valued random variable Y follows the mixture density 

fY\x{y) = j:,PkJ^{Xh.^l). (3) 

k=l 

Such mixture models are frequent in econometrics and chemometrics as de- 
scribed in the introduction of [Tlj. Estimation in these models can be per- 
formed using likelihood maximization as in [13] using the EM algorithm or 
a Bayesian methodology as studied in [H] using Monte Carlo Markov Chain 
techniques. 

1.1.2 Variable and model selection with nondifferentiahle penalties 

An important problem in the study of regression mixtures is the one of variable 
selection. This problem has been recently addressed in ^12j and several other 
works using various penalties based on nondifferentiahle norm-like functions 
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of the regression vectors like the h norm as in the LASSO [H] or SCAD 
introduced by [9]. 

The extension of the LASSO technique to the present situation consists of 
penahzing the log-hkehhood hke function / given by 

n , K -I -1 N 

¥} - glog exp (--(r.-X&)E.-'(y.-m))), (4) 

by the sum of the /i-norms of the regression vectors in order to obtain the 
following estimator 

K 

^ G argmaxeg:e/(6') - (5) 

k=l 

where A is a parameter to be adapted to the sample size. In the case of one 
regression with known variance cr^ instead of a mixture, the recent theory of 
the LASSO prescribes to take A proportional to y^21og(p)a^ even in the case 
where n <^ p; see for instance |3j. On the other hand, the paper |l2j studies 
the case of proper mixtures with unknown variances but restricts the analysis 
to n ^ p and p is constant as a function of n. A lot of work still remains to 
be done in the case where n <^ p and the variance is unknown. 



1.2 Our proposal: The mixture of self-regression with sparsity constraint 
1.2.1 The estimator 

In all what follows, we will assume that the data have been centered. Our 
proposal relies on the following simple idea: if the cluster probabilities 7r|'s, 
the class indices Zj's and the variances cr^'s where known ahead of time, the 
estimators of the /ifc's could be chosen, in the small sample setting, as linear 
combinations of the data vectors themselves, in the same spirit as the median 
is preferable to the mean based on the theory of robust statistics. However, 
chosing the right sample vectors to estimate the mean seems a priori to be a 
very hard task. The main ingredient in our proposal is to force our estimator 
to be very sparsely represented as a linear combination of the data. Thus, the 
estimation of the mean vectors can be seen as a estimating a mixture of sparse 
regressions. The problem of estimating the right covariates being NP-hard in 
general, the estimation can be performed using with a penalization enforcing 
sparsity just as the h penalty in the LASSO. 

In the more general case where the indices Zi are unobserved, and the clus- 
ter probabilities tt^'s and the variances a^'s are unknown, one can consider 
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maximizing the /i-penalized log-likelihood like function given by 



under the data-driven constraints fik = Y(3k for = 1, . . . , n where the matrix 
Y is given by y = [Yi, . . . , Y^^. In other words, we would like to maximize the 
/i-penalized likelihood like function 

lr.en{9) = ~m-\njZ\mi, (7) 
fc=l 

where 

1(9) = 2^ log ( 1^ Tik exp ( —2 



i.^.^ r/ie Space- Alternating ll-EM algorithm 

Optimizing the /i-penalized function (Ej) can be performed using an EM-type 
algorithm. The Expectation Step consists of computing the conditional expec- 
tation of the complete /i-penalized likelihood like function given the observa- 
tions Yi, . . . , F„ where the distribution of the latent variables is taken to be 
their marginal density parametrized by the approximation 9 of the true pa- 
rameter 6*. The resulting quantity is traditionally denoted by Q{6,9) and we 
will use the same notation in our /i-penalized context. We will use the general 
form of covariance matrices instead of all in order to present a more general 
form of the method. 

More precisely, the complete /i-penalized log-likelihood like function Ipeni.^)^ 
i.e. the penalized log-likelihood like function of the complete data 
. . . , (Fn, Zn) is given by 

7c in, ( 1 / {Y,-YPk)\Y,-YPk) ^\ 



Thus, we obtain 



n K 



1 / {Y.,-Yl3k)\Y.,-YPk) 



QiO, ^1 = E E log , exp - '-^ ^ r,,k (10) 

i=ik=i ^ JC^T^^Y 



where we used the standard notation Ti^k = Pe{Zi = k \ Yi, . . . ,Yn 
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The Maximization Step consists of maximizing 



Et, log ( vr.^^ exp ( - (^-Wm-m) ) _ ^j^k^^ ||^^||^. 



= ^„ -y^" ■ (12) 



(11) 

In order to make the computational part easy, the tt^'s, /^^'s and a^s can 
be optimized alternatively in the manner of the Gauss-Seidel approach. In 
fact, the separability of the problem into two subproblems, the first being 
optimization over the tt^'s and the second being optimization over the P^s 
and (Tfc's is already well known and the solution of the first subproblem is of 
the form ^ 

J2i=l Ti,k 
i=l 2^k=l 'i,k 

On the other hand, joint optimization in /3fc's and the (Jfc's is not separable 
and space alternating option is necessary to keep the computational complex- 
ity of each step at a low level. In order to address this problem, we need a 
generalization of the EM algorithm allowing for componentwise optimization 
at each step. Such penalized EM algorithms have been recently studied in the 
broader framework of Space Alternating Kullback Proximal Point Algorithms 
in [7]. Optimizing successively over the /3fc's at one iteration and over the a^s 
at the next iteration should be sufficiently efficient in practice. Here, we will 
also optimize one cluster at a time in order to obtain the injectivity conditions 
which are needed in the theoretical analysis of the algorithm. A simple way 
to accelerate this extreme version of the Gauss-Seidel methodology would be 
to average the new iterates and a^''^ with the previous respective iterates 
in order to smooth the algorithm's trajectory. 



2 Convergence analysis 



When using a maximum likelihood approach, incorporation of a nondiffer- 
entiable penalty in the EM algorithm may cause some technical difficulties. 
A rigourous analysis has been proposed in [7] in the case of general nondif- 
ferentiable penalties and space alternating optimization versions of the EM 
algorithm. The convergence analysis is made easier after interpreting the EM 
algorithm as a Proximal Point Algorithm which was first done in [5] (see also 
[6] for more precise results). 

In our special case, we only need to show that our Space- Alternating U-EM is 
a Space- Alternating Kullback Proximal Point Algorithm of the form studied 
in [7]E]. 

^ for a definition of the Clarke subdifferential, see the Appendix of [7j 
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Algorithm 1 Space- Alternating /i-EM algorithm 



Input L eN^ 

Choose intial iterate ^(o) = (^W^ . . . ^ 7rJ\ Z^^, ■ ■ ■ , PP , , ...^af] 
1= 1 

while / < L do 

(E-Step) Compute the conditional probabilities Pg(i-i)(Zj = k\Y) given 
the observations Fi, . . . , 1^ for i = 1, . . . , n and A; = 1, . . . , ii' using the 
following formula 

{i-i) 1 ^/ (y,-y4'~'')*(yi-y/3['-^)) 



-(0 



compute 

-either the 7r^''''s by the formula 

(I) l^i=l 'i,k /. 

-or as the solution of the LASSO-like optimization problem 

/3f e argmin^,K„||yr« - - A||/5||i. (15) 

for the index k updated in cyclic order along iterations, 
-or erf'' using the formula 

4" = -r^i:i|y.-nr"iiM2. (16) 

for one index k updated in cyclic order along iterations, 
cyclically 
end while 

Output 'n'^\ PI^^ and a^^^ for k = 1, . . . , K. 

Proposition 1 The space alternating ll-EM algorithm is a Space Alternating 
Kullback Proximal Point Algorithm as defined in Definition 2.1.1. of f^, i.e. 
the iterations can be written\^ 

e'^' = argmax,,e,.,,„„, ,,,,nD,^..,nD^ JiO) -Jl KPniMO)) - IyiO,e'). (17) 

r=l 

where i/j: W Si x ■ ■ ■ x Sr be a continuously differentiable mapping, X 



2 The relaxation sequence of Definition 2.1.1. of [7] is taken to be constant and 
equal to one 
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be a positive real vector in M.^, pn be a possibly nonsmooth penalty function 
with bounded Clarke subdifferential on compact sets and the parameter space 
is decomposed into subspaces 9^ = 6 fl Sr, r = 1, . . . , R where Si, . . . ,SFt are 
subspaces ofW and MP = ®^^^Sr. 

Proof. First, we adopt the decomposition of the parameter space into the 
cartesian product of the vr^'s space, the P^s space and the a^s space. More 
precisely 9i is the simplex in and Si = M^, 02, fc = M*^ = S2,k, and 
63 A: = IR+ and S3 k = M for k = 1, . . . , K . Thus r takes its values in the list 
{1,'(2,1),...,(2,K),(3,1),...,(3,K)}. 

Then the mappings are just the orthogonal projections onto Sr for r G 
{1, (2, 1), ... , (2, K), (3, 1), . . . , (3, K)}. Moreover Ai = and A(3,fc) = for A; = 
1, . . . ,K because the class probabilities and the variances are not penalized. 
Moreover A(2,a;) = A for A; = 1, . . . , i^'. 

Next, the Q-function can be written [1] 

Q{e,9) = li9)-Iy{9,9) (18) 



with 



where 



w) = iEUkmogQ^). (19) 



i=l k=l 



tik{0) = (20) 

Thus, the space alternating LASSO-EM algorithm is a special case of the 
Space Alternating Kullback Proximal Point Algorithm for which the sequence 
{fJ'k)km is constant and the terms are all equal to one. □ 

In the following, we will denote the list of values for the parameter r hy TZ = 
{1, (2, 1), ... , (2, iC), (3, 1), . . . , (3, K)}. We then have the following theorem. 

Theorem 2 Let 6* be a cluster point of the Space Alternating Penalized Kull- 
back Proximal sequence. If 9* lies in the interior of Dudei, then 9* satisfies the 
following property: there exists a set of subsets I** C /* where I* denotes the 
index of the active constraints at 9* , i.e. /* = {{i,j) s.t. tij{9*) = 0, and there 
is a family of real numbers Xij, {i,j) G X** , r ^ TZ such that the following 
Karush-Kuhn- Tucker condition for optimality holds at cluster point 9* : 

e vi{9*) - J2 Kdp^iMo*)) + E ^r,vt.,(r). 



see Section 4.1 of |7j for more details. 
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Proof. We start by verifying that Assumptions 2.2.1, 2.2.3 and Assumptions 
2.2.4 of [7] hold in our case. The differentiabihty requirement in Assumptions 
2.2.1.(i). is obvious. However, if one [3k belongs to the kernel of X, it may be 
of any arbitrary large norm without leading the log-likelihood towards — oo. 
However, note that, as is well known is Gaussian mixture models, / tends to 
+CXD only at finite number of degenerate points. Thus, since, the penalization 
term Pn tends to +oo as the norm of any [3k tends to +cxd, the difference 
/ — A2Pn(^2(')) tends to —oo if the norm of any [3k goes to +oo. Since I also 
tends to —oo as any variance tends to +oo, the term / — \2Pn tends to —oo 
when the norm of 9 tends to +oo. 

The domain Di is defined by the fact that the term inside the log in ((21) must 
be positive. On the other hand, for any ^ in 9 = 6i x 62,1 x ■ ■ ■ x Q2,k x 
63 1 X ■ ■ ■ X 03^;^, the domain D^^ q is the set of the ^'s for which the UkiO) 
are positive, and therefore, does not depend on 9. Moreover, the set of 6''s for 
which the tik{0) are positive is Di. Thus, the projection of Di onto the first 
coordinate is Di and Assumptions 2.2.1.(ii). are satisfied. 

Assumptions 2.2.1.(iii). is immediate since here the relaxation sequence (de- 
noted here by {iJik)k&i) is constant. Assumptions 2.2.1.(iv). is also straightfor- 
ward since the mappings '^r are orthogonal projections onto iS^, r G TZ. 

In our context, based on (l20ll . we have = t\og(t) — 1 and Assumptions 
2.2.3.(i)-(iii). are easily verified. Injectivity of the mapping t when restricted 
to U^^iQj^k is proved in [4j and thus, injectivity holds on each 9i,fe,. . . ,63,^ 
and Assumption 2.2.3. (iv) holds. 

Moreover, since tik{0) = implies that tt^ = and vr^ = implies 
for all j = 1, . . . , p and / = \, . . . ,K and 



(9) = (21) 



da' 



(9) = 0, (22) 



it follows that Psr{^tik{9*)) = ^tik{9*) if Sr is the vector space generated by 
the probability vectors vr and Ps^(Vtik{9*)) = otherwise. 

Let 9* be a cluster point in the interior of D^. Since the tik are clearly contin- 
uously differentiable around such a 9*, Corollary 1 in [7j gives that 9* satisfies 
the following property: there exists a set of subsets /* C /* and a family of real 
numbers Xij, E 2* , r E 71 such that the following Karush-Kuhn-Tucker 
condition for optimality holds at cluster point 9*: 

e vi{9*) - J2 XrdpMOl) + E Kj'^uAn, 
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which is the desired result. 



□ 



The meaning of this theorem is simply that a Karush-Kuhn- Tucker condition 
is satisfied at any cluster point in the domain of definition of the log-likelihood. 



3 Simulation results 

In this section, we address the question of testing the algorithm on simulated 
and datasets. The Space Alternating /i-EM was first tested on simulated data 
sets. The experiments were built as follows: 10 samples in were generated 
from three different gaussian distributions with the objective to recover the 
index of the distribution they were drawn from up to some index permutation. 
The class probabilities were taken as tti = .3, TT2 = .2 and tt^ = .5 and the 
variances as cr^ = 5, o"! = 7 and cr| = 10 without change through all the simu- 
lation experiments. Various experiments were performed using different values 
for the expectation vectors /ii, ^2 and /is since it could be easily suspected 
that the distance between them would play a major role in the class index 
recovery problem. The results presented below were obtained using the fol- 
lowing Monte Carlo scheme: the expectations were isotropic dilations of three 
points in M? drawn uniformly at random in the cube [— |, |]^- We ran the code 
for dilation factors d going from 10 to 100 by steps of 10. 

3.1 Two dimensional data 

An example of the type of result we obtained is given in Figure [1] below where 
the 10 points were correctly classified. 

Here is another example when the expectation vectors are chosen closer to 
each other and 8 points over 10 were correctly classified. 

Our Monte Carlo experiments are given in Figure [3l For each dilation pa- 
rameter d, 1000 Monte Carlo experiments were performed and the number of 
points correctly classified was computed by finding the best permutation of 
the set {1, 2, 3} matching the class indices obtained by the output of the space 
alternating li algorithm. 

These preliminary results show that the number of correctly recovered class 
indices increases with the dilation factor of the unit cube into which the ex- 
pectation vector were drawn uniformly at random. Moreover, the larger the 
dilation facteur, i.e. the better separated the Gaussians are, the closer to 80% 
of correctly identified class indices the space alternating li EM with data- 
driven constraint provides. 
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Fig. 1. The result obtained with the LASSO (or space alternating /i)-EM for centers 
drawn uniformly in the cube [—30,30]^. 

The number of correctly recovered class indices for 1000 Monte Carlo ex- 
periments in the case of 10 points as a function of the box into which the 
expectation vectors have been uniformly drawn are given in Table [T] below. 



Initial cube 


[-5,5]^ 


[-10,10]2 


[-15,15]2 


[-20,20]2 


[-25,25]2 


ANCRCI 


6.113 


6.982 


7.494 


7.891 


8.233 


Initial cube 


[-30,30]2 


[-35,35]2 


[-40,40]2 


[-45,45]^ 


[-50,50]2 


ANCRCI 


8.41 


8.376 


8.607 


8.712 


8.738 



3.2 Large dimensional data 



Recall that the estimation of the expectation vectors consists, for each cluster, 
of the selection of a very small number of data vectors, one linear combination 
of whose is suspected to lie near the center of the cluster. The main hope 
when we started to conceive this approach was that the dimension would 
have only a very small influence on the result and that estimation in large 
dimensions could be performed with almost as few data as in two dimensions. 
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Fig. 2. The result obtained with the space alternating h-EM for centers drawn 
uniformly in the cube [—15, 15]'^. 

In order to confirm this expectation, we performed Monte Carlo experiments 
in dimensions 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50. The results are presented 
in Table [TJ In order to compare with the standard likelihood approach for finite 
Gaussian mixtures, we gathered the results obtained for the same experiments 
in Table [2]0. 



As Table [U shows, the class index recovery rate is still quite good in dimension 
50 for well separed mixtures and confirm our primary intuition. A look at Table 
[2] shows that our method compares quite well with the standard likelihood 
approach for Gaussian mixtures estimation, especially in the higher dimensions 
where the average number of well classified data is better by often more than 
one unit. Giving a rigourous argument justifying these observations is currently 
under investigation but more experiments should be performed in order to 
explore in finer details the behavior of the method in more realistic context. 

We used the EM algorithm for Gaussian mixtures with covariance matrices equal 
to multiples of the identity matrix 
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r in 1 nlc^ 
[— lU, iUJ 


r 1 1 
[—15, loj 


r on onlc^ 
[— zU, zUJ 




r on ocWd 
[— oU, oUJ 


[— oo, ooj 


r AC\ AC\\d 

[— 4U, 4UJ 


[—45, 4oJ 


r txn txnlc^ 
[— oU, oUJ 


tt = 5 


7.U44 


7.672 


O 1 oo 

8.18z 


8.373 


O ~ AO 


o " ni 


O i^OO 

8.682 


O "7 

8.733 


8.63 


^ in 


/? /inn 
0.499 


1-7 /I no 

7.498 


T O O /I 

7.834 


O 1 'V /I 

8.174 


o /1 1 n 
8.419 


O /IOC 

8.425 


o c nn 

8.599 


O /I T1 

8.471 


o a 

8.6 


a = 15 


D.z41 


7.074 


^ r o 1 

7.581 


8.074 


o 1 r /( 

8.154 


o o /( n 

8.249 


8.347 


O O O O 

8.338 


o /loo 

8.422 


J on 


5.o7o 


6.544 


'7 oni 

7.391 


>-7 f-TOCt 

7.736 


>-7 nn 

7.905 


oni 

8.01 


o no'7 

8.037 


8.121 


O 1 0£? 

8.126 


d = 25 


5.665 


6.213 


6.796 


7.283 


7.427 


7.649 


7.613 


7.738 


7.87 


d = 30 


5.661 


6.189 


6.663 


7.063 


7.303 


7.412 


7.386 


7.576 


7.441 


d = 35 


5.684 


6.309 


6.538 


6.922 


7.132 


7.185 


7.278 


7.234 


7.396 


d = 40 


5.769 


6.296 


6.788 


6.977 


6.974 


7.218 


7.292 


7.369 


7.371 


d = 45 


5.763 


6.392 


6.842 


6.945 


7.19 


7.161 


7.334 


7.318 


7.306 


d = 50 


5.798 


6.545 


6.816 


7.007 


7.247 


7.223 


7.376 


7.369 


7.422 



Table 1. The average number of correctly recovered class indices (ANCRCI) over the 1000 Monte Carlo experiments shown in Figuremc is 
given for increasing sizes of the initial cubes where the expectation vectors are chosen uniformy at random and for increasing dimension 
of the sample space. 





r in 1 nlc^ 
[— lU, iUJ 


r 1 1 
[— io, loj 


r on onlc^ 
[— zU, zUJ 


[— zo, zoj 


r on onlc^ 
[— oU, oUJ 


r o o txi 
[— oo, ooj 


r An Ar\]d 
[— 4U, 4UJ 


r Ati A vi^d 
[—45, 4oJ 


r txn t^^\^d 

[— oU, oUJ 


tt = 5 


5.781 


b.Sob 


6.747 


6.864 


/? onn 

6.899 


^ n/^ 1 

7.061 


Til 

7.11 


T n /( 

7.04 


■"7 nTO 

7.073 


^ in 


5.79 


6.z06 


o.o4z 


o.ozz 


6.878 


O 1 o 

6.813 


6.896 


a no 1 

6.981 


T n 1 1 

7.011 


a = 15 


5.695 


6.31 


6.556 


6.646 


6.929 


r* a Ad 

6.848 


6.791 


6.897 


T nn r 

7.005 


J on 


5.d4o 


6.019 


6.374 


6.609 


6.655 


6.567 


6.86 


OOT 

6.837 


ono 

6.80z 


d = 25 


5.493 


5.932 


6.184 


6.407 


6.56 


6.606 


6.767 


6.705 


6.703 


d = 30 


5.557 


5.961 


6.169 


6.355 


6.48 


6.508 


6.492 


6.485 


6.546 


d = 35 


5.492 


5.898 


6.076 


6.173 


6.354 


6.274 


6.436 


6.24 


6.371 


d = 40 


5.514 


5.819 


6.151 


6.134 


6.195 


6.248 


6.212 


6.291 


6.437 


d = 45 


5.523 


5.854 


6.033 


6.062 


6.13 


6.141 


6.14 


6.189 


6.203 


d = 50 


5.467 


5.737 


6.039 


6.023 


6.134 


6.047 


6.091 


6.194 


6.118 



Table 2. The average number of correctly recovered class indices (ANCRCI) over the 1000 Monte Carlo experiments shown in Figuremc is 
given for increasing sizes of the initial cubes where the expectation vectors are chosen uniformy at random and for increasing dimension 
of the sample space. 




Fig. 3. For values of the dilation parameter going from 10 to 100 by steps of 10, 
the associated Monte Carlo simulation number is on the x-axis and the number of 
correctly identified class index among the 10 points is on the y-axis. 

4 Conclusion 



The goal of this paper was to propose a robust version of the maximum likeli- 
hood strategy for the estimation of finite Gaussian mixtures. Our approach is 
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based on self-regression and sparse variable selection. Sparsity was promoted 
by using an h penalty as in the LASSO. We developed a space alternating 
version of the penalized EM algrithm and proved that the interesting cluster 
points satisfy the Karush-Kuhn- Tucker optimality conditions. Our method 
was then tested on simulated datasets. In particular, the Monte Carlo exper- 
iments showed that cluster identification was more robust with our approach 
than by using the standard maximum likelihood estimator. Theoretical jus- 
tifications of these observations ought to be investigated in a near future in 
order to increase our understanding of the strengths and weaknesses of this 
approach. 

Acknowledgement. The author would like to thank Amelie Vaniscotte for 
very helpful discussions on the results of this paper. 



References 

[1] C. Biernacki, G. Celeux, G. Govaert and F. Langrognet, (2006) "Model-based 
cluster and discriminant analysis with the MIXMOD software", Computational 
Statistics and Data Analysis, Vol. 51, 2, 587-600. 

[2] C. Biernacki and S. Chretien, (2003) "Degeneracy in the maximum likelihood 
estimation of univariate Gaussian mixtures with EM", Statist. Probab. Lett. 61, 
no. 4, 373-382. 

[3] E. Candes and Y. Plan, (2009) "Near ideal model selection by li penalization". 
The Annals of Statistics, to appear. 

[4] G. Celeux, S. Chretien, F. Forbes and A. Mkhadri (2001) 'A Component-Wise 
EM Algorithm for Mixtures" Journal of Computational and Graphical Statistics, 
vol. 10, no. 4, 697-712. 

[5] S. Chretien and A. Hero, (2000) "Kullback proximal algorithms for maximum- 
likelihood estimation". Information-theoretic imaging. IEEE Trans. Inform. 
Theory 46, no. 5, 1800-1810 

[6] S. Chretien and A. Hero, (2008) " On EM algorithms and their proximal 
generalizations". ESAIM P&S 12, 308-326 

[7] S. Chretien, A. Hero and H. Perdry 

(2008) "Space Alternating Penalized Kullback Proximal Point Algorithms for 
Maximing Likelihood with Nondifferentiable Penalty", submitted. Available at 
http: //arxiv.org/abs/0901.0017 

[8] A. P. Dempster, N. M. Laird, and D. B. Rubin, (1977) "Maximum likehhood 
from incomplete data via the EM algorithm," J. Royal Statistical Society, Ser. 
B, vol. 39, no. 1, pp. 1-38. 



16 



[9] J. Fan and R. Li (2001) "Variable selection via non-concave penalized likelihood 
and its oracle properties", Journal of the American Statistical Association, 96, 
1348-1360. 

[10] J. A. Fessler, and A. O. Hero, (1994)"Space-alternating generalized expectation- 
maximization algorithm", IEEE Trans. Signal Processing, vol. 42, no. 10, pp. 
2664-2677. 

[11] M. Hurn M, A. Justel and CP. Robert, (2003) "Estimating Mixtures of 
Regressions", Journal of Computational and Graphical Statistics, 12, 55-79. 

[12] A. KhaHH and J. Chen, (2007) "Variable Selection in Finite Mixture of 
Regression Models", Journal of the American Statistical Association, Volume 
102, Number 479, pp. 1025-1038. 

[13] G.J. McLachlan and D. Peel. (2000) Finite Mixture Models. Wiley 

[14] R. Tibshirani, (1996) " Regression shrinkage and selection via the LASSO", 
Journal of the Royal Statistical Society, Series B, vol. 58, no. 1, pp. 267-288. 



17 



